Method and system for thread monitoring

ABSTRACT

An apparatus and methods for hardware-based performance monitoring of a computer system are presented. The apparatus includes: processing units; a memory; a connector device connecting the processing units and the memory; probes inserted the processing units, and the probes generating probe signals when selected processing events are detected; and a thread trace device connected to the connector device. The thread trace device includes an event interface to receive probe signals, and an event memory controller to send probe event messages to the memory, where probe event messages are based on probe signals. The probe event messages transferred to memory can be subsequently analyzed using a software program to determine, for example, thread-to-thread interactions.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates generally to performance monitoring incomputer systems.

2. Background Art

Computer systems, for example, computer processors including centralprocessor units (CPU) and graphics processors (GPU), are capable ofexecuting increasing numbers of processing threads in parallel. Theparallel execution of numerous threads can yield substantial increasesin performance and overall efficiency in the computer system.

Debugging computer applications is complex. The complexity of debuggingincreases when the application concerned executes in an environmenthaving multiple threads or processes. Multiple simultaneously executingthreads can cause processing delays due to numerous issues such asthread synchronization, resource sharing, resource contention etc. Forexample, designers of a GPU having multiple execution units may expect aparticular level of performance based on the number of execution units,but some applications having a large number of parallel threads mayyield a much lower level of performance due to thread interactionissues.

Conventionally, most processor and application designers have debuggedissues such as thread interaction using instrumented code and/orperformance counters. Instrumenting the code, in general, involvesinserting additional statements in the code before and/or after selectedprocessing steps. The additional statements usually are directed tosteps such as incrementing or decrementing performance counters, orwriting debug messages. In general, such additional statements increasethe size of the executable code and slows the processing speed due toadditional steps and output requirements. Therefore, althoughinstrumenting the code allows for many debugging issues to be resolved,by allowing the behavior of the application to be changed due toadditional processing steps, many complex issues involving multiplethreads may go undetected.

Performance counters are implemented by instrumenting the code and/orusing hardware-based probes to increment and decrement a set of softwarecounters or registers. Performance counters count the occurrences ofeach of a predetermined set of events. Unlike instrumented code,hardware-based probes can be inserted so as not to impact the generalprocessing flow of the system.

In many computer systems, numerous performance counters are available.For example, performance counters may provide a count of the number ofthreads executing at a given time, the highest number of threads thatwere executing in parallel at any point during the execution of anapplication, and/or the highest level of memory usage during theexecution of an application, etc. However, performance counters, evenwhen implemented using hardware-based probes, can provide only a view ofsystem performance that is aggregated over defined time intervals.Performance counters cannot illustrate the interactions between any twothreads that happen to be executing simultaneously.

In the case of both instrumented code and performance counters, the useris often left to trial and error to detect application issues whilecontrolling the impact of additional debugging steps on applicationperformance and interactions. For example, at some debugging levels, somany performance counters may be accessed or so many debug statementsmay be written, that the memory input/output may be increased to a levelthat impacts the servicing of processing threads.

What is needed therefore is a hardware-based dynamic thread performancemonitoring system that that monitors the performance of the systemwithout impacting the actual performance of applications.

BRIEF SUMMARY OF THE INVENTION

Apparatus and methods for hardware-based performance monitoring of acomputer system are presented. In one embodiment, an apparatus formonitoring the performance of a computer system, includes: one or moreprocessing units; a memory; a connector device connecting the one ormore processing units and the memory; one or more probes inserted in atleast one of said processing units, and said one or more probesgenerating probe signals when predetermined processing events aredetected; and a thread trace device connected to the connector device.The thread trace device includes an event interface configured toreceive probe signals, and an event memory controller configured to sendprobe event messages to the memory, wherein probe event messages arebased on probe signals.

In another embodiment a method for monitoring performance of a computersystem is presented. The method includes: inserting one or more eventprobes in one or more hardware-based processing units, where the eventprobes are configured to generate probe events when predeterminedprocessing events are detected; configuring a hardware-based device togenerate probe event messages based on said probe events; andtransferring the probe event messages to a memory. The probe eventmessages transferred to memory can be analyzed using a software program.

Further embodiments, features, and advantages of the present invention,as well as the structure and operation of the various embodiments of thepresent invention, are described in detail below with reference to theaccompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS/FIGURES

The accompanying drawings, which are incorporated in and constitute partof the specification, illustrate embodiments of the invention and,together with the general description given above and the detaileddescription of the embodiment given below, serve to explain theprinciples of the present invention. In the drawings:

FIG. 1 is an apparatus to monitor the performance of a computer system,according to an embodiment of the present invention.

FIG. 2 shows the thread monitoring module of FIG. 1, according to oneembodiment of the present invention.

FIG. 3 shows a typical sequence of thread events that may be monitoredper thread in an embodiment of the present invention.

FIG. 4 shows a flowchart of steps generating probe events in hardwarecomponents according to an embodiment of the present invention.

FIG. 5 shows a flowchart of steps occurring in a thread trace modulewhen probe events are collected according to an embodiment of thepresent invention.

FIG. 6 is a flowchart illustrating the dynamic control of the flow ofprobe event traffic between the monitor device and a memory controller,according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

While the present invention is described herein with illustrativeembodiments for particular applications, it should be understood thatthe invention is not limited thereto. Those skilled in the art withaccess to the teachings provided herein will recognize additionalmodifications, applications, and embodiments within the scope thereofand additional fields in which the invention would be of significantutility.

It would be apparent to one of skill in the art that the presentinvention, as described below, may be implemented in many differentembodiments of hardware, firmware, software (which may include hardwaredescription language code), and/or the entities illustrated in thefigures. Any actual software code with the specialized control ofhardware to implement the present invention is not limiting of thepresent invention. Thus, the operational behavior of the presentinvention will be described with the understanding that modificationsand variations of the embodiments are possible, given the level ofdetail presented herein.

This disclosure presents systems and methods for hardware-basedperformance monitoring in a computer system. A person skilled in the artwill recognize that the teachings provided herein may be applied tomonitoring various aspects affecting the performance of a computersystem, such as, including but not limited to, processes and threads.Without loss of generality, the computer system environment described inthis disclosure comprises primarily of a graphic processing unit (GPU)having multiple execution units, and implementing applications involvingnumerous simultaneous threads created by multiple processes includingvertex shaders, geometry shaders, and pixel shaders.

FIG. 1 is a computer system 100 according to an embodiment of thepresent invention. Computer system 100 includes a control unit 102, oneor more execution units 104, a local memory 106, a memory controller108, a communications bus 110, and a thread trace device 112. Computersystem 100, in this embodiment, may represent a graphics processing unit(GPU). In other embodiments, computer system 100 may represent, forexample and without limitation, a central processing unit (CPU), amultiple processor device, a field programmable gate array (FPGA)-basedprocessing device, a digital signal processor (DSP)-based processingdevice, or an application-specific integrated circuit (ASIC)-basedprocessing device.

Control unit 102 may represent any processor capable of executing anapplication program. Based on the application, control unit 102 cancreate and distribute threads and processes and/or issue instructions tobe processed by a plurality of processing units or execution units 104.In FIG. 1, four separate execution units are shown: an arithmetic andlogic unit (ALU) 104 a, an ALU 104 b, a branch unit 104 c, and a texturefetch unit 104 d. For example, during the execution of an applicationprocess, control unit 102 may spawn multiple simultaneous threads.

During the execution of one or more of these threads, control unit 102may issue ALU-specific instructions (such as, for example, arithmetic ormovement instructions) to ALU 104 a and/or ALU 104 b, branchingdeterminations to branch unit 104 c, and texture instructions to thetexture unit 104 d. Instructions issued to execution units 104 a-104 dcan execute simultaneously, and may each notify control unit 102 whenone or more assigned instructions are completed. Control unit 102 and/orexecution units 104 may include internal registers, for example, tomaintain performance counters for monitoring of events such as number ofthreads created, number of threads successfully completed, number ofinstructions of defined types, highest number of simultaneously activethreads, etc.

Local memory device 106 may comprise one or more dynamic memory devicessuch as a random access memory (RAM). Local memory 106 may be utilizedby control unit 102 and execution units 104 to store and retrieveinstructions and/or data. For example, instructions may be allocatedamong execution units by control unit 102 writing those instructions toa predetermined location in local memory 106 and execution units beingnotified in some manner such as an interrupt. Local memory 106 may also,for example, be used to exchange data between threads and betweenexecution units 104 and control unit 102. Local memory 106 may alsoinclude memory used as registers to maintain various performancecounters.

Memory controller device 108 coordinates access to local memory 106. Insome embodiments, memory controller device 108 may also coordinateaccess to an external memory (not shown). For example, when control unit102 and execution unit 104 simultaneously request to write some data tolocal memory 106, memory controller 108 coordinates the writing of thatdata to memory 106. The communication between devices requiring accessto memory 106 and memory controller 108 may use messages exchanged viacommunications bus 110 or some other mechanism such as interrupts.

Communications bus (or system bus) 110 may be any device interconnectingmechanism such as, but not limited to, peripheral component interconnect(PCI). A person skilled in the art will understand that a multitude oftechnologies can be used for communications bus 110. Communications bus110, directly or indirectly, interconnects devices 102, 104, 106, 108,and 112. Depending on the communications protocol used to interconnectvarious devices over communications bus 110, the processing capacity ofthe computer system 100 may be affected by the capacity ofcommunications bus 110 to transfer instructions and data between devicesinterconnected to it.

Thread trace device 112 is configured to detect and collectpredetermined event types that occur in some or all of the devices ofcomputer system 100, including devices 102, 104, 106, 108, and 110. Forexample, thread trace device 112 can monitor a predefined set of probesand collect data whenever those probes are triggered. In the embodimentillustrated in FIG. 1, probes can be implemented in one or more devicesincluding control unit 102 and execution units 104. Probes can also beinserted to monitor traffic on communications bus 110, and to monitoractivity in memory controller 108. Thread trace device 112 can monitorone or more probes, collect data, perform filtering of the probe dataaccording to user and system requirements, and transfer the data tomemory so that the data can be analyzed using a separate software module(not shown).

Thread trace device 112 can also actively monitor the system performanceand dynamically reconfigure the collection and transfer of probe data sothat the collection and transfer of probe data does not significantlyaffect system performance. Thread trace device 112 can generally beimplemented as one or more separate circuits interconnected to the restof the computer system 100. Probes can be implemented in many ways,including using circuitry that generates appropriate signals to threadtrace device 112 by monitoring registers at regular clock intervals.

FIG. 2 is an illustration of thread trace device 112 according to oneembodiment. Thread trace device 112 can include an event interfacedevice 202, an event filter device 204, a configuration interface device206, a timestamper device 208, and an error handler device 210. Threadtrace device 112 can also include an event packer device 212, a memorybuffer controller device 214, an event flow controller device 216, anevent memory 218, and interconnections connecting the devices such as,but not limited to, a communication bus 222. Event interface device 202provides the interface for thread trace device 112 to receive probeevents from other devices of computer system 100.

In one embodiment of the present invention, event interface 202 can beimplemented as a set of registers that are updated by signals 224generated by devices of system 100 and monitored every clock cycle bythread trace device 112 to generate a set of incoming probe events.Event filter 204 filters the incoming probe events based onconfiguration and/or system performance. For example, user configurationreceived through configuration device 206 may define that all probeevents other than thread-create and thread-terminate events should befiltered out. Event filtering device 204 can then drop (i.e.,filter-out) all incoming probe events except thread-create andthread-terminate events as specified by the user from being furtherprocessed in thread trace device 112.

Configuration device 206 can include an interface, such as a JTAG (IEEE1149.1 Standard Test Access Port and Boundary-Scan Architecture)interface, that allows a user to activate or deactivate a set of probeevent monitors. A user, in this case, can be a human operator or acomputer program.

In one embodiment, thread trace device 112 can generate a probe eventmessage from incoming probe events, for example, as part of theprocessing in event interface device 202. The probe event messages canbe, for example, generated in an event memory 218. In anotherembodiment, the incoming probe events can be received as probe eventmessages. Probe event messages can have a fixed format or a variableformat that is understood by devices in thread trace device 112, andperhaps also by software programs that access the probe event messagesstored in memory 106. Timestamper device 208 timestamps probe eventmessages to be processed. The timestamp can be based on clock cyclessince the last reset of thread trace device 112. The timestamp should beof sufficient granularity to detect thread interactions in eachparticular application, and can be configurable.

In one embodiment, for example, it may be sufficient to maintain only adelta timestamp from the previous event, and thereby reduce the numberof bits required to maintain the timestamp in each probe event message.Timestamper device 208 can also insert timestamp messages into the probeevent message stream as necessary to maintain a trail of the time.

Error handler device 210 can include functionality to handle probeevents that are missing. For example, error handler device 210 can, bymonitoring the probe event message sequence, insert a predeterminedmarker to indicate the type and content of a missing probe event, suchthat the application processing the probe events can still make use ofthe probe event messages. Error handler device 210 may also makeavailable the functionality to attach user data based on each event typeto each corresponding probe event message. In some embodiments, errorhandler device 210 can also compress the event data as appropriate. Forexample, event data can be compressed according to a scheme that iscustomized to an application that would subsequently process the eventdata.

Event packer device 212 can arrange the probe event messages in eventmemory 218 to generate a block of event messages that can be efficientlytransferred to memory 106 or other memory (not shown) through memorycontroller 108. Event packer device 212 can also include some of thefunctionality to compress the event messages as mentioned above. Eventpacker device 212 packs one or more probe event messages into packedunits of probe event data that can be transferred through memorycontroller 108. A packed unit of probe event data can include one ormore timestamped probe event messages embedded with error handlingmarkers as necessary and compressed as appropriate.

Event buffer controller device 214 controls an internal memory buffer220 through which thread trace device 112 transfers packed units ofevent data. Internal buffer 220 can be implemented as afirst-in-first-out (FIFO) buffer sized to hold a multiple of packedunits of probe events. In some embodiments, internal buffer 220 can bepart of event data memory 218. Event data memory 218 can be accessibleby many devices within thread trace device 112, including error handlerdevice 210 and event packer device 212.

Event buffer controller device 214 may include functionality to storepacked event data in an internal memory buffer 220, to address thepacked event data to be stored in local memory 106, and to coordinatethe transfer of that event data through memory controller 108.

Event flow controller device 216 includes the functionality to receivefeedback from memory controller 108 and accordingly adjust the rate atwhich probe events are processed and output from thread trace device 112as packed units of event data. For example, if feedback from memorycontroller 108 indicates that memory accesses in the system are above apredetermined threshold, then event flow control device 216 can initiateaction in thread trace device 112 to have incoming probe events filteredat an increased level so that the rate at which packed event data istransferred to memory from thread trace device is reduced.

Similarly, when feedback from memory controller 108 indicates thatmemory accesses in the system are below a predetermined threshold, thenevent flow controller device 216 can initiate action to have incomingprobe events filtered at a lowered level so that the rate at whichpacked event data is transferred to memory from thread trace device 112may be increased. Event flow controller device 216 allows the tracelevel to be adjusted dynamically to suit system conditions.

FIG. 3 is an example set of thread events that may be collected duringthe lifetime of a thread, according to one embodiment of the presentinvention. A thread create event 302 is generated when a process or athread is spawned by a process. Using computer system 100 as an example,in general, thread create events 302 are originated on control unit 102.A thread create event 302 contains event data elements such as a threadidentifier, a thread type, parent process or thread, and identifier ofthe processor upon which the thread was created. At the end of thethread's lifetime, a thread terminate event 308 is issued, usually fromthe same processor in which the thread create 302 event was issued from.However, note that thread create event 302 and thread terminate event308 may not always originate from the same processor.

Between thread create event 302 and thread terminate event 308 are manyinstruction issues to accomplish one or more processing tasks. After thethread create event 302, the thread proceeds to step through each of theinstructions to be processed in step 303. Instruction issue event 304 isgenerated each time an instruction is issued, for example, by controlunit 102. The issued instructions may be assigned to one or moreexecution units 104 or other processor. Therefore, for example,instruction issue probes may be present in control unit 102 as well asexecution units 104. Instruction issue data events should identify suchdata elements as, instruction identifier, instruction type, issuingthread or process, assigned execution unit. When the respectiveexecution unit has completed executing the instruction, an instructioncomplete event 306 is generated. Instruction complete event 306 may begenerated by the respective execution unit or processor to which theinstruction was assigned.

Probe events corresponding to events 302, 304, 306, and 308 identify thethread would yield an accurate timeline of the processing of aparticular thread. Thread trace device 112 monitors probe events andcollects them according to user specified criteria and systemrequirements. When combined with probe events generated corresponding toother threads that were simultaneously active in the computer system, asubstantially complete view of threads and thread interactions of thecomputer system can be obtained. Such a view can then be used toidentify issues, including thread interaction issues.

FIG. 4 is a flowchart 400 illustrating an exemplary sequence ofprocessing probe events in one embodiment of the present invention wherean application is executed generating probe events. In step 402, anapplication process is instantiated or started on a processor, forexample, control unit 102 in computer system 100. Each thread spawned bythe application process, executes independently and may themselves spawnother threads.

More specifically, FIG. 4 is an illustration of a probe event sequencewhen two threads are active. In step 404, threads are spawned generatingprobe events. Specifically, step 404 a issues a probe event when threadthread-1 is created, and step 404 b separately issues a probe event whenthread thread-2 is created. The temporal relationship of steps 404 a and404 b is based on when each thread is spawned. But the temporalrelationship between steps 406 a and 406 b, 408 a and 408 b, etc.,depend on additional factors such as the type of instruction and thetime taken to execute each instruction.

For each thread, as the execution proceeds generating probe events suchas issuing instructions, steps 406 (or specifically, step 406 a inthread-1 and step 406 b in thread-2) and 408 (or specifically, step 408a in thread-1 and step 408 b in thread-2) are repeated. A set ofpredetermined events types are configured to trigger probe events. Asthe processing progresses, such events are encountered by the processingthread in step 406 and in step 408, a probe event is generated for thatevent if configured to do so. When a thread completes execution in step410 (or specifically, step 410 a in thread-1 and step 410 b inthread-2), a thread complete probe event may be generated. When allthreads spawned by the application have terminated, and all otherprocessing by the application have completed, the application processterminates in step 412.

FIG. 5 is a flowchart 500 illustrating exemplary processing steps inthread trace device 112 when a probe event is received. In step 502, aprobe event is received in thread trace device 112, for example,triggered by an event as described in flowchart 400. Probe events may bereceived by many means. For example, thread trace device 112 mayactively poll registers to which probe events are written when theyoccur, such as, for example, performance counters for various types ofthread events spread throughout the computer system 100.

As another example, thread trace device 112 may receive a signal foreach probe event that is triggered, such as, for example, a data packetincluding the information necessary to distinctly identify each probeevent. In either case, in step 502, a data packet that is representativeof the probe event (i.e., a probe event message) can be generated. Foreach probe event received in thread trace device 112, a determination ismade in filtering step 504 whether to further process the received probeevent. For example, due to user configuration or other considerations,it may be determined to filter out all probe events except for threadcreate events and thread terminate events.

If it is determined that the received probe event is to be furtherprocessed, then in step 506 the received probe event message may betimestamped. The timestamp enables precise ordering of the probe eventsin subsequent analysis. In step 508, error control can be performed onthe probe event data. For example, in step 508, missing probe events,particularly those that are necessary for a useful analysis of thesystem behavior, may be represented in an aggregated manner. Particularprobe events that are not received in the thread trace device may berepresented with an appropriate error handling marker so that thesubsequent analysis can distinguish event data inserted from event datathat was actually observed. For example, based on the inserted marker,an application that analyzes the data can recreate some or all of themissing probe events, prior to analysis.

Also in step 508, in some embodiments, some level of compression may beperformed in accordance with the requirements of the application thatwould subsequently access the probe event data for analysis. Forexample, the timestamp may be truncated to only provide the requiredgranularity for a particular scenario, frequently occurring events maybe encoded to make the corresponding probe event messages smaller, orselected events may be aggregated or deleted.

In step 510, the probe event messages are packed. For example, one ormore probe event messages may be packed together to make a event dataunit of a predetermined size that can be transmitted to memory. Thepacking of the event data packets may take place in a memory that islocal to the thread trace device (e.g., event memory 218 in thread tracedevice 112). In step 512, after packing each received probe event, adetermination is made if enough data is in the event data unit to betransmitted out. In step 514, if the event data unit is deemed to besufficient it is deposited in a buffer (e.g., transfer from event memory218 to buffer 220 in device 112) to be transmitted out to memory.Otherwise, additional probe events are needed before that event dataunit can be stored in the outgoing buffer. The event data unit may alsobe addressed appropriately to be stored in memory (e.g., fortransferring to memory 106).

Once the event data unit is created and deposited in the outgoing memorybuffer, a memory controller, for example, memory controller 108 incomputer system 100, can retrieve the event data unit, and according tothe address specified in the event data unit, transfer that to a memory,such as memory 106. An application that enables the analysis ofcollected trace data may access the probe event message data stored inmemory, in real-time or after the completion of the event generatingapplication.

Such an application or trace event processing module can be implementedin software and executed on control unit 102, another control unit (notshown) in computer system 100, or an external computer (not shown)connected to computer system 100. If internal to computer system 100,then the trace event processing module can access the relevant data inmemory 106 through memory controller 108. If the trace event processingmodule is executed on an external computer, then suitable softwareshould be available on computer system 100 to provide the relevant datafrom memory 106.

The transfer of probe events to memory consumes system resources. Forexample, as probe event messages (or event data units) are transferredto memory over communication bus 110, the corresponding memory trafficis increased proportionally to the probe event generation frequency. Atsome level of probe event generation, the probe event related memoryactivity may interfere with the system performance and/or threadbehavior. For example, in computer system 100, when event data units aretransferred to memory 106 by memory controller 108 throughcommunications bus 110, the use of bus 110 by the execution units 104for accessing data for regular processing purposes may be adverselyaffected. It is generally highly desirable to ensure that the systemmonitoring activity does not interfere with the system performanceand/or thread behavior.

FIG. 6 is a flowchart 600 that illustrates an exemplary scheme fordynamically controlling the probe event collection to ensure that theprobe event collection does not adversely affect system performance. Instep 602, the memory controller, such as memory controller 108 ofcomputer system 100, notifies a status change to the thread trace devicesuch as event flow control device 216 of thread control device 112. Byway of example, the change in status reported by the memory controllercan include information of the memory traffic level, information that athreshold level of memory traffic has been met, or any other informationthat enables the thread trace unit to throttle or accelerate the probeevent collection. For example, if the memory controller reports thecurrent level of memory traffic, in step 604, the thread trace devicecan determine whether the current level of memory traffic exceeds apredetermined upper threshold. Alternatively in step 612, the threadtrace device can determine whether the reported traffic level is below apredetermined lower threshold.

If the traffic level is found to exceed the predetermined upperthreshold, then in step 606, an appropriate filtering level or criteriamay be determined. In step 608, the event filter can be adjusted toimplement a filtering criteria that reduces the memory traffic due toprobe events. For example, event flow controller device 216 can causethe event filter device 204 to drop all probe events other than threadcreate events and thread terminate events. A person skilled in the artwill understand that many configuration variations of the filteringdevice may be made to reduce the amount of probe events beingtransferred to memory. Error handler device 210 may also be notified, instep 608, so that appropriate error handling markers can be insertedinto the affected probe event data units.

If, in step 612, it is determined that the memory traffic level hasdropped below a predetermined lower threshold, then in step 614, anappropriate decrease in filtering level may be determined. Subsequently,in step 616, the event filter device may be notified to allow more probeevents to be processed. For example, event flow controller device 214may cause the event filter device 204 to allow all probe events. Aperson skilled in the art will understand that many configurationvariations of the filtering device may be made to increase the amount ofprobe events being transferred to memory. Error handler device 210 mayalso be notified, in step 616, so that appropriate error handlingmessages can be inserted into the affected probe event data units.

In general, the steps of flowchart 600 when implemented can allow acomputer system to operate without its processing activities beingsignificantly affected while simultaneously generating a high level oftrace information. In contrast to conventional debugging and monitoringsystems in which the level of desired trace information must bespecified statically before system startup, the present invention allowsthe system to dynamically configure itself to obtain the maximum amountof trace information without affecting system performance.

CONCLUSION

Embodiments of the present invention may be used in any computer systemor computing device where monitoring of one or more concurrentlyexecuting processes or threads is desired. For example and withoutlimitation, embodiments may include computers, game platforms,entertainment platforms, personal digital assistants, and videoplatforms.

The Summary and Abstract sections may set forth one or more but not allexemplary embodiments of the present invention as contemplated by theinventor(s), and thus, are not intended to limit the present inventionand the appended claims in any way.

The present invention has been described above with the aid offunctional building blocks illustrating the implementation of specifiedfunctions and relationships thereof. The boundaries of these functionalbuilding blocks have been arbitrarily defined herein for the convenienceof the description. Alternate boundaries can be defined so long as thespecified functions and relationships thereof are appropriatelyperformed.

The foregoing description of the specific embodiments will so fullyreveal the general nature of the invention that others can, by applyingknowledge within the skill of the art, readily modify and/or adapt forvarious applications such specific embodiments, without undueexperimentation, without departing from the general concept of thepresent invention. Therefore, such adaptations and modifications areintended to be within the meaning and range of equivalents of thedisclosed embodiments, based on the teaching and guidance presentedherein. It is to be understood that the phraseology or terminologyherein is for the purpose of description and not of limitation, suchthat the terminology or phraseology of the present specification is tobe interpreted by the skilled artisan in light of the teachings andguidance.

The breadth and scope of the present invention should not be limited byany of the above-described exemplary embodiments, but should be definedonly in accordance with the following claims and their equivalents.

What is claimed is:
 1. A method for monitoring performance of a computer system, the method comprising: inserting one or more probes in one or more hardware-based processing units, wherein said probes are configured to generate probe signals in response to detecting predetermined processing events; configuring a hardware-based device to: receive said probe signals from the probes; store probe event messages for the received probe signals in an event memory in the hardware-based device; transfer the probe event messages from the event memory to a memory external to the hardware-based device via a first-in-first-out buffer; regulate a frequency of the transfer of the probe event messages from the event memory based on a feedback; and configuring a memory controller device to: transfer probe event messages from the first-in-first-out buffer to the memory; and provide the feedback to the hardware-based device, wherein the feedback includes a current measure of accesses to the memory.
 2. The method of claim 1, wherein the configuring a hardware-based device comprises: collecting said probe signals; and timestamping said probe event messages, wherein the probe event messages are based on said probe signals.
 3. The method of claim 2, wherein the configuring a hardware-based device further comprises: filtering said probe signals, wherein filtering is based on configurable filter criteria.
 4. The method of claim 2, wherein the configuring a hardware-based device further comprises: compressing said probe event messages.
 5. The method of claim 2, wherein the configuring a hardware-based device further comprises: error processing of said probe event messages.
 6. The method of claim 1, further comprising: accessing said probe event messages in said memory, wherein the accessing is by a software program to analyze the performance of the computer system.
 7. An apparatus for monitoring the performance of a computer system, comprising: one or more processing units; a memory; a connector device connecting the one or more processing units and the memory; one or more probes (i) inserted in at least one of said processing units and (ii) configured to generate probe signals in response to detecting predetermined processing events; a hardware-based thread trace device connected to the connector device, the thread trace device including (i) an event memory, (ii) an event interface configured to receive said probe signals from the probes and to store probe event messages for the received probe signals in the event memory, and (iii) an event memory buffer controller configured to send probe event messages from the event memory to said memory; a first-in-first-out buffer; and a memory controller, device configured to: transfer probe event messages from the thread trace, device to the memory via the first-in-first-out buffer; and provide feedback to the thread trace device, wherein the feedback includes a current measure of accesses to the memory, wherein the thread trace device is configured to regulate a frequency of the transfer of the probe event messages from the event memory based on the feedback.
 8. The apparatus of claim 7, wherein the thread trace device further comprises: a timestamper device configured to timestamp said probe event messages.
 9. The apparatus of claim 8, wherein the thread trace device further comprises: an event data packing device configured to generate a probe event unit, wherein the probe event unit includes one or more said probe event messages.
 10. The apparatus of claim 7, wherein the thread trace device further comprises: a filtering device configured to filter said received probe signals based on a filter criteria.
 11. The apparatus of claim 10, wherein the filter criteria is configurable.
 12. The apparatus of claim 9, wherein the thread trace device further comprises: an error handler device configured to process said probe event messages for errors.
 13. The apparatus of claim 12, wherein the error handler device is further configured to insert error handling markers into said probe event unit.
 14. The apparatus of claim 9, wherein the event data packing device is further configured to compress said probe event unit.
 15. The apparatus of claim 7, further comprising: a trace event processing module configured to access probe event messages in said memory.
 16. The apparatus of claim 7, wherein said selected processing events include thread events. 